{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "0",
   "metadata": {},
   "source": [
    "## Training Policies for Gym using PGPE and CoSyNE\n",
    "\n",
    "This example demonstrates how you can train policies using EvoTorch and Gym. To execute this example, you will need to install the subpackages of `gymnasium` via:\n",
    "\n",
    "```bash\n",
    "pip install 'gymnasium[box2d,mujoco]'\n",
    "```\n",
    "\n",
    "This example is based on our paper [1] where we describe the ClipUp optimiser and compare it to the Adam optimiser. In particular, we will re-implement the experiment for the \"LunarLanderContinuous-v2\" environment. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1",
   "metadata": {},
   "source": [
    "## Defining the Problem\n",
    "\n",
    "To begin with, we will need to create the Problem class. To do this, we will first define the policy we wish to use. All experiments in [1], except \"HumanoidBulletEnv-v0\", used a linear policy. Let's define this as a `torch` module. Additionally, throughout experiments, the presence of a bias was varied, so we'll add that as a parameter to the module so that you can freely play with this parameter."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2",
   "metadata": {},
   "outputs": [],
   "source": [
    "import torch\n",
    "from torch import nn\n",
    "from evotorch.decorators import pass_info\n",
    "\n",
    "\n",
    "# The decorator `@pass_info` used below tells the problem class `GymNE`\n",
    "# to pass information regarding the gym environment via keyword arguments\n",
    "# such as `obs_length` and `act_length`.\n",
    "@pass_info\n",
    "class LinearPolicy(nn.Module):\n",
    "    def __init__(\n",
    "        self, \n",
    "        obs_length: int, # Number of observations from the environment\n",
    "        act_length: int, # Number of actions of the environment\n",
    "        bias: bool = True,  # Whether the policy should use biases\n",
    "        **kwargs # Anything else that is passed\n",
    "    ):\n",
    "        super().__init__()  # Always call super init for nn Modules\n",
    "        self.linear = nn.Linear(obs_length, act_length, bias = bias)\n",
    "        \n",
    "    def forward(self, obs: torch.Tensor) -> torch.Tensor:\n",
    "        # Forward pass of model simply applies linear layer to observations\n",
    "        return self.linear(obs)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3",
   "metadata": {},
   "source": [
    "Now we're ready to define the problem. Let's start with the \"LunarLanderContinuous-v2\" environment."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4",
   "metadata": {},
   "outputs": [],
   "source": [
    "from evotorch.neuroevolution import GymNE\n",
    "\n",
    "problem = GymNE(\n",
    "    env=\"LunarLanderContinuous-v2\",  # Name of the environment\n",
    "    network=LinearPolicy,  # Linear policy that we defined earlier\n",
    "    network_args = {'bias': False},  # Linear policy should not use biases\n",
    "    num_actors= 4,  # Use 4 available CPUs. Note that you can modify this value, or use 'max' to exploit all available CPUs\n",
    "    observation_normalization = False,  # Observation normalization was not used in Lunar Lander experiments\n",
    ")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5",
   "metadata": {},
   "source": [
    "## Creating the searcher\n",
    "\n",
    "With our problem created, we're ready to create the searcher. We're using PGPE and ClipUp with the parameters described in [2]:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6",
   "metadata": {},
   "outputs": [],
   "source": [
    "from evotorch.algorithms import PGPE\n",
    "\n",
    "radius_init = 4.5  # (approximate) radius of initial hypersphere that we will sample from\n",
    "max_speed = radius_init / 15.  # Rule-of-thumb from the paper\n",
    "center_learning_rate = max_speed / 2.\n",
    "\n",
    "searcher = PGPE(\n",
    "    problem,\n",
    "    popsize=200,  # For now we use a static population size\n",
    "    radius_init= radius_init,  # The searcher can be initialised directely with an initial radius, rather than stdev\n",
    "    center_learning_rate=center_learning_rate,\n",
    "    stdev_learning_rate=0.1,  # stdev learning rate of 0.1 was used across all experiments\n",
    "    optimizer=\"clipup\",  # Using the ClipUp optimiser\n",
    "    optimizer_config = {\n",
    "        'max_speed': max_speed,  # with the defined max speed \n",
    "        'momentum': 0.9,  # and momentum fixed to 0.9\n",
    "    }\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7",
   "metadata": {},
   "source": [
    "## Training the policy\n",
    "\n",
    "Now we're ready to train. We'll run evolution for 50 generations, and use the `StdOutLogger` logger to track progress. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8",
   "metadata": {},
   "outputs": [],
   "source": [
    "from evotorch.logging import StdOutLogger\n",
    "\n",
    "StdOutLogger(searcher)\n",
    "searcher.run(50)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9",
   "metadata": {},
   "source": [
    "With our agent trained, it is straight-forward to now visualize the learned behaviour. For this, we will use $\\mu$, the learned center of the search distribution, as a 'best estimate' for a good policy for the environment. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "10",
   "metadata": {},
   "outputs": [],
   "source": [
    "center_solution = searcher.status[\"center\"]  # Get mu\n",
    "policy_net = problem.to_policy(center_solution)  # Instantiate a policy from mu\n",
    "for _ in range(10):  # Visualize 10 episodes\n",
    "    result = problem.visualize(policy_net)\n",
    "    print('Visualised episode has cumulative reward:', result['cumulative_reward'])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "11",
   "metadata": {},
   "source": [
    "## Training with CoSyNE\n",
    "\n",
    "As an alternative, we consider training the policy with the CoSyNE [2] algorithm. We use a configuration close to that used for pole-balancing experiments [2]. For this, we'll use additional evaluation repeats as the algorithm is more sensitive to noise, so to begin with we'll ensure the actors of the previous `GymNE` instance are killed and define a new `GymNE` instance. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "12",
   "metadata": {},
   "outputs": [],
   "source": [
    "problem.kill_actors()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "13",
   "metadata": {},
   "outputs": [],
   "source": [
    "problem = GymNE(\n",
    "    env=\"LunarLanderContinuous-v2\",\n",
    "    network=LinearPolicy,\n",
    "    network_args = {'bias': False},\n",
    "    num_actors= 4, \n",
    "    observation_normalization = False,\n",
    "    num_episodes = 3,\n",
    "    initial_bounds = (-0.3, 0.3),\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "14",
   "metadata": {},
   "source": [
    "Defining the algorithm configuration, we aim to keep the overall evaluations-per-generation roughly the same, so use 50 individuals per generation. Additionally, we'll keep 1 elite individual-per-generation, to encourage exploitation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "15",
   "metadata": {},
   "outputs": [],
   "source": [
    "from evotorch.algorithms import Cosyne\n",
    "searcher = Cosyne(\n",
    "    problem,\n",
    "    num_elites = 1,\n",
    "    popsize=50,  \n",
    "    tournament_size = 4,\n",
    "    mutation_stdev = 0.3,\n",
    "    mutation_probability = 0.5,\n",
    "    permute_all = True, \n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "16",
   "metadata": {},
   "source": [
    "Once again running for 50 generations with a `StdOutLogger` attached to output the progress:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "17",
   "metadata": {},
   "outputs": [],
   "source": [
    "StdOutLogger(searcher)\n",
    "searcher.run(50)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "18",
   "metadata": {},
   "source": [
    "And once again we can visualize the learned policy. As `Cosyne` is population based, it does not maintain a 'best estimate' of a good policy. Instead, we simply take the best performing solution from the current population. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "19",
   "metadata": {},
   "outputs": [],
   "source": [
    "center_solution = searcher.status[\"pop_best\"]  # Get the best solution in the population\n",
    "policy_net = problem.to_policy(center_solution)  # Instantiate the policy from the best solution\n",
    "for _ in range(10): # Visualize 10 episodes\n",
    "    result = problem.visualize(policy_net)\n",
    "    print('Visualised episode has cumulative reward:', result['cumulative_reward'])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "20",
   "metadata": {},
   "source": [
    "#### References\n",
    "\n",
    "[1] Toklu, et. al. \"ClipUp: a simple and powerful optimizer for distribution-based policy evolution.\" [International Conference on Parallel Problem Solving from Nature](https://dl.acm.org/doi/abs/10.1007/978-3-030-58115-2_36). Springer, Cham, 2020.\n",
    "\n",
    "[2] Gomez, Faustino, et al. [\"Accelerated Neural Evolution through Cooperatively Coevolved Synapses.\"](https://www.jmlr.org/papers/volume9/gomez08a/gomez08a.pdf) Journal of Machine Learning Research 9.5 (2008)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.16"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
